Efficient Query Optimization for Distributed Join in Database Federation
نویسنده
چکیده
Database federation is one approach to data integration, in which a middleware, called mediator, provides uniform access to a number of heterogeneous data sources. For the mediator, two key components are query rewriter and query optimizer. In this thesis, we focus on the query optimizer part, particularly, on cost-based query optimization for distributed joins over database federation. One important observation in query optimization over distributed database system is that run-time conditions (namely available buffer size, CPU utilization in machine and network environment) can significantly affect the execution cost of a query plan. However, in existing database federation systems, very few studies have addressed runtime conditions. It is a challenging problem, because usually the mediator is not able to know the run-time conditions of remote sites and considering run-time conditions will bring about extra complexity to the optimizer. This thesis proposes the Cluster-and-Conquer algorithm for query optimization over database federation while efficiently considering run-time conditions. I firstly propose to view the whole federation as a clustered system, by grouping data sources based on network infrastructure or enterprise boundaries; and then provide each cluster of data sources with its own cluster mediator. The query optimization is divided into two procedures: the global mediator decides inter-cluster operations, and cluster mediators handle the sub queries within the cluster with run-time condition consideration. This algorithm has three-fold benefits. Firstly, the run-time conditions of machines are now available for cluster mediator, because the communication within a cluster is timeefficient. Secondly, each cluster mediator can deal with its own sub query concurrently, so the complexity of processing query plan is decreased. Thirdly, the algorithm outperforms other related approaches in terms of “cost of costing”, because it removes unnecessary inter-cluster operations in the early stage of query plan selection. I have implemented a prototype data federation system with Cluster-and-Conquer algorithm. The experimental results showed the capabilities and efficiency of our algorithm and described the target scenarios where the algorithm performs better than other related approaches.
منابع مشابه
Communication-Efficient Implementation of Range-Joins in Sensor Networks
Sensor networks are multi-hop wireless networks of resource constrained sensor nodes used to realize high-level collaborative sensing tasks. To query and access data generated and stored at the sensor nodes, the sensor network can be looked upon as a distributed database. The unique characteristics of sensor networks such as limited memory and energy resources at each node make efficient execut...
متن کاملRelational Databases Query Optimization using Hybrid Evolutionary Algorithm
Optimizing the database queries is one of hard research problems. Exhaustive search techniques like dynamic programming is suitable for queries with a few relations, but by increasing the number of relations in query, much use of memory and processing is needed, and the use of these methods is not suitable, so we have to use random and evolutionary methods. The use of evolutionary methods, beca...
متن کاملUPSP: Unique Predicate-based Source Selection for SPARQL Endpoint Federation
Efficient source selection is one of the most important optimization steps in federated SPARQL query processing as it leads to more efficient query execution plan generation. An over-estimation of the data sources will generate extra network traffic by retrieving irrelevant intermediate results. Such intermediate results will be excluded after performing joins between triple patterns. Consequen...
متن کاملDynamic Join Order Optimization for SPARQL Endpoint Federation
The existing web of linked data inherently has distributed data sources. A federated SPARQL query system, which queries RDF data via multiple SPARQL endpoints, is expected to process queries on the basis of these distributed data sources. During a federated query, each data source may consist of a search space of nontrivial size. Therefore, finding the optimal join order to minimize the size of...
متن کاملComplex Query JOIN Optimization in Parallel Distributed Environment
The research work covers the query optimization concept in parallel distributed environment. The queries considered are select-project-join (SPJ) queries with large databases. The main query operation considered for research is JOIN operation of the query. For fast execution of a complex query, JOIN operation time needs to be minimized. Different JOIN operation algorithms such as Network Byte O...
متن کامل